Method for synthesizing speech

ABSTRACT

The present invention relates to a method for analyzing speech, the method comprising the steps of: a) inputting a speech signal, b) obtaining the first harmonic of the speech signal, c) determining the phase-difference Df between the speech signal and the first harmonic.

FIELD OF THE INVENTION

The present invention relates to the field of analyzing and synthesizingof speech and more particularly without limitation, to the field oftext-to-speech synthesis.

BACKGROUND AND PRIOR ART

The function of a text-to-speech (TTS) synthesis system is to synthesizespeech from a generic text in a given language. Nowadays, TTS systemshave been put into practical operation for many applications, such asaccess to databases through the telephone network or aid to handicappedpeople. One method to synthesize speech is by concatenating elements ofa recorded set of subunits of speech such as demisyllables orpolyphones. The majority of successful commercial systems employ theconcatenation of polyphones. The polyphones comprise groups of two(diphones), three (triphones) or more phones and may be determined fromnonsense words, by segmenting the desired grouping of phones at stablespectral regions. In a concatenation based synthesis, the conversationof the transition between two adjacent phones is crucial to assure thequality of the synthesized speech. With the choice of polyphones as thebasic subunits, the transition between two adjacent phones is preservedin the recorded subunits, and the concatenation is carried out betweensimilar phones.

Before the synthesis, however, the phones must have their duration andpitch modified in order to fulfill the prosodic constraints of the newwords containing those phones. This processing is necessary to avoid theproduction of a monotonous sounding synthesized speech. In a TTS system,this function is performed by a prosodic module. To allow the durationand pitch modifications in the recorded subunits, many concatenationbased TTS systems employ the time-domain pitch-synchronous overlap-add(TD-PSOLA) (E. Moulines and F. Charpentier, “Pitch synchronous waveformprocessing techniques for text-to-speech synthesis using diphones,”Speech Commun., vol. 9, pp. 453467, 1990) model of synthesis.

In the TD-PSOLA model, the speech signal is first submitted to a pitchmarking algorithm. This algorithm assigns marks at the peaks of thesignal in the voiced segments and assigns marks 10 ms apart in theunvoiced segments. The synthesis is made by a superposition of Hanningwindowed segments centered at the pitch marks and extending from theprevious pitch mark to the next one. The duration modification isprovided by deleting or replicating some of the windowed segments. Thepitch period modification, on the other hand, if provided by increasingor decreasing the superposition between windowed segments.

Despite the success achieved in many commercial TTS systems, thesynthetic speech produced by using the TD-PSOLA model of synthesis canpresent some drawbacks, mainly under large prosodic variations, outlinedas follows.

-   -   1. The pitch modifications introduce a duration modification        that needs to be appropriately compensated.    -   2. The duration modification can only be implemented in a        quantized manner, with a one pitch period resolution (α= . . .        ,1/2,2/3,3/4, . . . ,4/3,3/2,2/1, . . . ).    -   3. When performing a duration enlargement in unvoiced portions,        the repetition of the segments can introduce “metallic”        artifacts (metallic-like sounding of the synthesized speech).

In IEEE transactions on speech and audio processing, vol. 6, No. 5,September 1998, “A Hybrid Model for Text-to-Speech Synthesis”, FábioViolaro and Olivier Böeffard, a hybrid model for concatenation-based,text-to-speech synthesis is described.

The speech signal is submitted to a pitch-synchronous analysis anddecomposed into a harmonic component, with a variable maximum frequency,plus a noise component. The harmonic component is modeled as a sum ofsinusoids with frequencies multiple of the pitch. The noise component ismodeled as a random excitation applied to an LPC filter. In unvoicedsegments, the harmonic component is made equal to zero. In the presenceof pitch modifications, a new set of harmonic parameters is evaluated byresampling the spectrum envelope at the new harmonic frequencies. Forthe synthesis of the harmonic component in the presence of durationand/or pitch modifications, a phase correction is introduced into theharmonic parameters.

A variety of other so called “overlap and add” methods are known fromthe prior art, such as PIOLA (Pitch Inflected OverLap and Add) [P.Meyer, H. W. Rüh, R. Krüger, M. Kugler L. L. M. Vogten, A. Dirksen, andK. Belhoula. PHRITTS: A text-to-speech synthesizer for the Germanlanguage. In Eurospeech '93, pages 877-890, Berlin, 1993], or PICOLA(Pointer Interval Controlled OverLap and Add) [Morita: “A study onspeech expansion and contraction on time axis”, Master thesis, NagoyaUniversity (1987), in Japanese.] These methods differ from each other inthe way they mark the pitch period locations.

None of these methods give satisfactory results when applied as a mixerfor two different waveforms. The problem is phase mismatches. The phasesof harmonics are affected by the recording equipment, room acoustics,distance to the microphone, vowel color, co-articulation effects etc.Some of these factors can be kept unchanged like the recordingenvironment but others like the co-articulation effects are verydifficult (if not, impossible) to control. The result is that when pitchperiod locations are marked without taken into account the phaseinformation, the synthesis quality will suffer from phase mismatches.

Other methods like MBR-PSOLA (Multi Band Resynthesis Pitch SynchronousOverLap Add) [T. Dutoit and H. Leich. MBR-PSOLA: Text-to-speechsynthesis based on an MBE re-synthesis of the segments database. SpeechCommunication, 1993] regenerate the phase information to avoid phasemismatches. But this involves an extra analysis-synthesis operation thatreduces the naturalness of the generated speech. The synthesis oftensounds mechanic.

U.S. Pat. No. 5,787,398 shows an apparatus for synthesizing speech byvarying pitch. One of the disadvantages of this approach is that sincethe pitch marks are centered on the excitation peaks and the measuredexcitation peak does not necessarily have synchronous phase, phasedistortion results.

The pitch of synthesized speech signals is varied by separating thespeech signals into a spectral component and an excitation component.The latter is multiplied by a series of overlapping window functionssynchronous, in the case of voiced speech, with pitch timing markinformation corresponding at least approximately to instants of vocalexcitation, to separate it into windowed speech segments which are addedtogether again after the application of a controllable time-shift. Thespectral and excitation components are then recombined. Themultiplication employs at least two windows per pitch period, eachhaving a duration of less than one pitch period.

U.S. Pat. No. 5,081,681 shows a class of methods and related technologyfor determining the phase of each harmonic from the fundamentalfrequency of voiced speech.

Applications include speech coding, speech enhancement, and time scalemodification of speech. The basic approach is to include recreatingphase signals from fundamental frequency and voiced/unvoicedinformation, and adding a random component to the recreated phase signalto improve the quality of the synthesized speech.

U.S. Pat. No. 5,081,681 describes a method for phase synthesis forspeech processing. Since the phase is synthetic the result of thesynthesis does not sound natural as many aspects of the human voice andthe acoustics of the surround are ignored by the synthesis.

SUMMARY OF THE INVENTION

The present invention provides for a method for analyzing of speech, inparticular natural speech. The method for analyzing of speech inaccordance with the invention is based on the discovery, that the phasedifference between the speech signal, in particular a diphone speechsignal, and the first harmonic of the speech signal is a speakerdependent parameter which is basically a constant for differentdiphones.

In accordance with a preferred embodiment of the invention this phasedifference is obtained by determining a maximum of the speech signal andby determining the phase zero, i. e. the positive zero crossing of thefirst harmonic. The difference between the phases of the maximum andphase zero is the speaker dependent phase difference parameter.

In one application this parameter serves as a basis to determine awindow function, such as a raised cosine or a triangular window.Preferably the window function is centered on the phase angle which isgiven by the zero phase of the first harmonic plus the phase difference.Preferably the window function has its maximum at that phase angle. Forexample, the window function is chosen to be symmetric with respect tothat phase angle.

For speech synthesis diphone samples are windowed by means of the windowfunction, whereby the window function and the diphone sample to bewindowed are offset by the phase difference.

The diphone samples which are windowed this way are concatenated. Thisway the natural phase information is preserved such that the result ofthe speech synthesis sounds quasi natural.

In accordance with a preferred embodiment of the invention controlinformation is provided which indicates diphones and a pitch contour.For example such control information can be provided by the languageprocessing module of a text-to-speech system.

It is a particular advantage of the present invention in comparison toother time domain overlap and add methods that the pitch period (or thepitch-pulse) locations are synchronized by the phase of the firstharmonic.

The phase information can be retrieved by low-pass filtering the firstharmonic of the original speech signal and using the positivezero-crossing as indicators of zero-phase. This way, the phasediscontinuity artifacts are avoided without changing the original phaseinformation.

Applications for the speech synthesis methods and the speech synthesisdevice of the invention include: telecommunication services, languageeducation, aid to handicapped persons, talking books and toys, vocalmonitoring, multimedia, man-machine communication.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following preferred embodiments of the invention are described ingreater detail by making reference to the drawings in which:

FIG. 1 is illustrative of a flow chart of a method to determine thephase difference between a diphone at its first harmonic,

FIG. 2 is illustrative of signal diagrams to illustrate an example ofthe application of the method of FIG. 1,

FIG. 3 is illustrative of an embodiment of the method of the inventionfor synthesizing speech,

FIG. 4 shows an application example of the method of FIG. 3,

FIG. 5 is illustrative of an application of the invention for processingof natural speech,

FIG. 6 is illustrative of an application of the invention fortext-to-speech,

FIG. 7 is an example of a file containing phonetic information,

FIG. 8 is an example of a file containing diphone information extractedfrom the file of FIG. 7,

FIG. 9 is illustrative of the result of a processing of the files ofFIGS. 7 and 8,

FIG. 10 shows a block diagram of a speech analysis and synthesisapparatus in accordance with the present invention.

DETAILED DESCRIPTION

The flow chart of FIG. 1 is illustrative of a method for speech analysisin accordance with the present invention. In step 101 natural speech isinputted. For the input of natural speech known training sequences ofnonsense words can be utilized. In step 102 diphones are extracted fromthe natural speech. The diphones are cut from the natural speech andconsist of the transition from one phoneme to the other.

In the next step 103 at least one of the diphones is low-pass filteredto obtain the first harmonic of the diphone. This first harmonic is aspeaker dependent characteristic which can be kept constant during therecordings.

In step 104 the phase difference between the first harmonic and thediphone is determined. Again this phase difference is a speaker specificvoice parameter. This parameter is useful for speech synthesis as willbe explained in more detail with respect to FIGS. 3 to 10.

FIG. 2 is illustrative of one method to determine the phase differencebetween the first harmonic and the diphone (cf. step 4 of FIG. 1). Asound wave 201 acquired from natural speech forms the basis for theanalysis. The sound wave 201 is low-pass filtered with a cut-offfrequency of about 150 Hz in order to obtain the first harmonic 202 ofthe sound wave 201. The positive zero-crossings of the first harmonic202 define the phase angle zero. The first harmonic 202 as depicted inFIG. 2 covers a number of 19 succeeding complete periods. In the exampleconsidered here the duration of the periods slightly increases fromperiod 1 to period 19. For one of the periods the local maximum of thesound waveform 201 within that period is determined.

For example the local maximum of the sound wave 201 within the period 1is the maximum 203. The phase of the maximum 203 within the period 1 isdenoted as φ_(max) in FIG. 2. The difference Δφ between φ_(max) and thezero phase φ₀ of the period 1 is a speaker dependent speech parameter.In the example considered here this phase difference is about 0.3 π. Itis to be noted that this phase difference is about constant irrespectiveof which one of the maxima is utilized in order to determine this phasedifference. It is however preferable to choose a period with adistinctive maximum energy location for this measurement. For example ifthe maximum 204 within the period 9 is utilized to perform this analysisthe resulting phase difference is about the same as for the period 1.

FIG. 3 is illustrative of an application of the speech synthesis methodof the invention. In step 301 diphones which have been obtained fromnatural speech are windowed by a window function which has its maximumat φ₀+Δφ; for example a raised cosine which is centered with respect tothe phase φ₀+Δφ can be chosen.

This way pitch bells of the diphones are provided in step 302. In step303 speech information is inputted. This can be information which hasbeen obtained from natural speech or from a text-to-speech system, suchas the language processing module of such a text-to-speech system.

In accordance with the speech information pitch bells are selected. Forinstance the speech information contains information of the diphones andof the pitch contour to be synthesized. In this case the pitch bells areselected accordingly in step 304 such that the concatenation of thepitch bells in step 305 results in the desired speech output in step306.

An application of the method of FIG. 3 is illustrated by way of examplein FIG. 4. FIG. 4 shows a sound wave 401 which consists of a number ofdiphones. The analysis as explained with respect to FIGS. 1 and 2 aboveis applied to the sound wave 401 in order to obtain the zero phase φ₀for each of the pitch intervals. As in the example of FIG. 2 the zerophase φ₀ is offset from the phase φ_(max) of the maximum within thepitch interval by a phase angle of Δφ which is about constant.

A raised cosine 402 is used to window the sound wave 401. The raisedcosine 402 is centered with respect to the phase φ₀+Δφ. Windowing of thesound wave 401 by means of the raised cosine 402 provides successivepitch bells 403. This way the diphone waveforms of the sound wave 401are split into such successive pitch bells 403. The pitch bells 403 areobtained from two neighboring periods by means of the raised cosinewhich is centered to the phase φ₀+Δφ. An advantage of utilizing a raisedcosine rather than a rectangular function is that the edges are smooththis way. It is to be noted that this operation is reversible byoverlapping and adding all of the pitch bells 403 in the same order;this produces about the original sound wave 401.

The duration of the sound wave 401 can be changed by repeating orskipping pitch bells 403 and/or by moving the pitch bells 403 towards orfrom each other in order to change the pitch. The sound wave 404 issynthesized this way by repeating the same pitch bell 403 with a higherthan the original pitch in order to increase the original pitch of thesound wave 401. It is to be noted that the phases remain in tact as aresult of this overlapping operation because of the prior windowoperation which has been performed taking into account thecharacteristic phase difference Δφ. This way pitch bells 403 can beutilized as building blocks in order to synthesize quasi-natural speech.

FIG. 5 illustrates one application for processing of natural speech. Instep 501 natural speech of a known speaker is inputted. This correspondsto inputting of a sound wave 401 as depicted in FIG. 4. The naturalspeech is windowed by the raised cosine 402 (cf. FIG. 4) or by anothersuitable window function which is centered with respect to the zerophase φ₀+Δφ.

This way the natural speech is decomposed into pitch bells (cf. pitchbell 403 of FIG. 4) which are provided in step 503.

In step 504 the pitch bells provided in step 503 are utilized as“building blocks” for speech synthesis. One way of processing is toleave the pitch bells as such unchanged but leave out certain pitchbells or to repeat certain pitch bells. For example if every fourthpitch bell is left out this increases the speed of the speech by 25%without otherwise altering the sound of the speech. Likewise the speechspeed can be decreased by repeating certain pitch bells.

Alternatively or in addition the distance of the pitch bells is modifiedin order to increase or decrease the pitch.

In step 505 the processed pitch bells are overlapped in order to producea synthetic speech waveform which sounds quasi natural.

FIG. 6 is illustrative of another application of the present invention.In step 601 speech information is provided. The speech informationcomprises phonemes, duration of the phonemes and pitch information. Suchspeech information can be generated from text by a state of the arttext-to-speech processing system.

From this speech information provided in step 601 the diphones areextracted in step 602. In step 603 the required diphone locations on thetime axis and the pitch contour is determined based on the informationprovided in step 601.

In step 604 pitch bells are selected in accordance with the timing andpitch requirements as determined in step 603. The selected pitch bellsare concatenated to provide a quasi natural speech output in step 605.

This procedure is further illustrated by means of an example as shown inFIGS. 7 to 9.

FIG. 7 shows a phonetic transcription of the sentence “HELLO WORLD!”.The first column 701 of the transcription contains the phonemes in theSAMPA standard notation. The second column 702 indicates the duration ofthe individual phonemes in milliseconds. The third column comprisespitch information. A pitch movement is denoted by two numbers: position,as a percentage of the phoneme duration, and the pitch frequency in Hz.

The synthesis starts with the search in a previously generated databaseof diphones. The diphones are cut from real speech and consist of thetransition from one phoneme to the other. All possible phonemecombinations for a certain language have to be stored in this databasealong with some extra information like the phoneme boundary. If thereare multiple databases of different speakers, the choice of a certainspeaker can be an extra input to the synthesizer.

FIG. 8 shows the diphones for the sentence “HELLO WORLD!”, i.e. allphoneme transitions in the column 701 of FIG. 7.

FIG. 9 shows the result of a calculation of the location of the phonemeboundaries, diphone boundaries and pitch period locations which are tobe synthesized. The phoneme boundaries are calculated by adding thephoneme durations. For example the phoneme “h” starts after 100 ms ofsilence. The phoneme “schwa” starts after 155 ms=100 ms+55 ms, and soon.

The diphone boundaries are retrieved from the database as a percentageof the phoneme duration. Both the location of the individual phonemes aswell as the diphone boundaries are indicated in the upper diagram 901 inFIG. 9, where the starting points of the diphones are indicated. Thestarting points are calculated based on the phoneme duration given bycolumn 702 and the percentage of phoneme duration given in column 703.

The diagram 902 of FIG. 9 shows the pitch contour of “HELLO WORLD!”. Thepitch contour is determined based on the pitch information contained inthe column 703 (cf. FIG. 7). For example, if the current pitch locationis at 0.25 seconds than the pitch period would be at 50% of the first‘1’ phoneme. The corresponding pitch lies between 133 and 139 Hz. It canbe calculated with a linear equation:

$\begin{matrix}{\frac{{( {{0.8 \cdot 63} + {0.5 \cdot 64}} ) \cdot 133} + {( {{0.2 \cdot 128} + {0.5 \cdot 64}} ) \cdot 139}}{{0.8 \cdot 63} + 64 + {0.2 \cdot 128}} = {135.5\mspace{14mu}{Hz}}} & (1)\end{matrix}$

The next pitch location would than be at 0.2500+1/135.5=0.2574 seconds.It is also possible to use a non-linear function (like the ERB-ratescale) for this calculation. The ERB (equivalent rectangular bandwidth)is a scale that is derived from psycho-acoustic measurements (Glasbergand Moore, 1990) and gives a better representation by taking intoaccount the masking properties of the human ear. The formula for thefrequency to ERB-transformation is:ERB(f)=21.4·log¹⁰(4.37·f)  (2)where f is the frequency in kHz. The idea is that the pitch changes inthe ERB-rate scale are perceived by the human ear as linear changes.

Note that unvoiced regions are also marked with pitch period locationseven though unvoiced parts have no pitch.

The varying pitch is given by the pitch contour in the diagram 902 isalso illustrated within the diagram 901 by means of the vertical lines903 which have varying distances. The greater the distance between twolines 903 the lower the pitch. The phoneme, diphone and pitchinformation given in the diagrams 901 and 902 is the specification forthe speech to be synthesized. Diphone samples, i.e. pitch bells (cfpitch bell 403 of FIG. 4) are taken from a diphone database. For each ofthe diphones a number of such pitch bells for that diphone isconcatenated with a number of pitch bells corresponding to the durationof the diphone and a distance between the pitch bells corresponding tothe required pitch frequency as given by the pitch contour in thediagram of 902.

The result of the concatenation of all pitch bells is a quasi naturalsynthesized speech. This is because phase related discontinuities atdiphone boundaries are prevented by means of the present invention. Thiscompares to the prior art where such discontinuities are unavoidable dueto phase mismatches of the pitch periods.

Also the prosody (pitch/duration) is correct, as the duration of bothsides of each diphone has been correctly adjusted. Also the pitchmatches the desired pitch contour function.

FIG. 10 shows an apparatus 950, such as a personal computer, which hasbeen programmed to implement the present invention. The apparatus 950has a speech analysis module 951 which serves to determine thecharacteristic phase difference Δφ. For this purpose the speech analysismodule 951 has a storage 952 in order to store one diphone speech wave.In order to obtain the constant phase difference Δφ only one diphone issufficient.

Further the speech analysis module 951 has a low-pass filter module 953.The low-pass filter module 953 has a cut-off frequency of about 150 Hz,or another suitable cut-off frequency, in order to filter out the firstharmonic of the diphone stored in the storage 952.

The module 954 of the apparatus 950 serves to determine the distancebetween a maximum energy location within a certain period of the diphoneand its first harmonic zero phase location (this distance is transformedinto the phase difference Δφ). This can be done by determining the phasedifference between zero phase as given by the positive zero crossing ofthe first harmonic and the maximum of the diphone within that period ofthe harmonic as it has been illustrated in the example of FIG. 2.

As a result of the speech analysis the speech analysis module 951provides the characteristic phase difference Δφ and thus for all thediphones in the database the period locations (on which e.g. the raisedcosine windows are centered to get the pitch-bells). The phasedifference Δφ is stored in storage 955.

The apparatus 950 further has a speech synthesis module 956. The speechsynthesis module 956 has storage 957 for storing of pitch bells, i.e.diphone samples which have been windowed by means of the window functionas it is also illustrated in FIG. 2. It is to be noted that the storage957 does not necessarily have to be pitch-bells. The whole diphones canbe stored with period location information, or the diphones can bemonotonized to a constant pitch. This way it is possible to retrievepitch-bells from the database by using a window function in thesynthesis module.

The module 958 serves to select pitch bells and to adapt the pitch bellsto the required pitch. This is done based on control informationprovided to the module 958.

The module 959 serves to concatenate the pitch bells selected in themodule 958 to provide a speech output by means of module 960.

List of Reference Numerals

-   sound wave 201-   first harmonic 202-   maximum 203-   maximum 204-   sound wave 401-   raised cosine 402-   pitch bell 403-   sound wave 404-   column 701-   column 702-   column 703-   diagram 901-   diagram 902-   apparatus 950-   speech analysis module 951-   storage 952-   low pass filter module 953-   module 954-   storage 955-   speech synthesis module 956-   storage 957-   module 958-   module 959-   module 960

1. A method, operable in a computer system, for analyzing of speech, themethod causing the computer system to execute the acts of: inputting aspeech signal; obtaining a first harmonic of the speech signal,determining a phase-difference (Δφ) between the speech signal and thefirst harmonic for centering a windowing function, wherein said phasedifference is determined between a phase of a maximum amplitude of saidspeech signal and a phase zero of the first harmonic, wherein azero-crossing of the first harmonic defines the phase zero of the firstharmonic; and outputting the phase difference to a memory for storage.2. The method of claim 1, wherein the determining comprises the act ofdetermining a location of said maximum of the speech signal.
 3. Themethod of claim 1, whereby the speech signal is a diphone signal.
 4. Acomputer readable medium storing a computer program product which whenloaded into a computer system caused the computer system to perform amethod in accordance with claim
 1. 5. The method of claim 1, wherein thezero-crossing is a positive zero-crossing.
 6. The method of claim 1,further comprising the act of extracting diphones from the speechsignal, wherein the obtaining act includes low-pass filtering of thediphones.
 7. A method for synthesizing speech, the method, operable in acomputer system, comprising the acts of: windowing by a window functiondiphone samples obtained from a speech signal; selecting the windoweddiphone samples, wherein the window function is centered with respect toa phase angle which is determined as a phase difference between a phaseof a maximum amplitude of said speech signal and a phase zero of a zerocrossing of a first harmonic of the speech signal; and concatenating theselected windowed diphone samples to form the synthesized speech; andoutputting the synthesized speech.
 8. The method of claim 7, the speechsignal being a diphone signal.
 9. The method of claim 7, the windowfunction being a raised cosine or a triangular window.
 10. The method ofclaim 7 further comprising inputting of information being indicative ofdiphones and a pitch contour, the information forming the basis forselecting of the windowed diphone samples.
 11. The method of claim 7,wherein the information is provided from a language processing module ofa text-to-speech system.
 12. The method of claim 7 further comprisingthe acts of: inputting of speech, and windowing the speech by the windowfunction to obtain the windowed diphone samples.
 13. The method of claim7, wherein the window function is centered on the phase angle which isequal to the phase difference plus the phase zero.
 14. The method ofclaim 7, wherein the window function is be symmetric with respect to thephase angle.
 15. The method of claim 7, wherein the window function andthe diphone samples that are windowed are offset by the phasedifference.
 16. A speech analysis device for analyzing a speech signalcomprising: a filter for obtaining a first harmonic of the speechsignal, a processor for determining a phase difference (Δφ) between thespeech signal and the first harmonic for centering a windowing function,wherein said phase difference is determined between a phase of a maximumamplitude of said speech signal and a phase zero (φ₀) of the firstharmonic, wherein a zero-crossing of the first harmonic defines thephase zero.
 17. The speech analysis device of claim 16, wherein thespeech signal is a diphone signal.
 18. A speech synthesis devicecomprising a processor configured for: selecting of windowed diphonesamples of a speech signal, the diphone samples being windowed by awindow function being centered with respect to a phase angle which isdetermined as a phase difference between the speech signal and a firstharmonic of the speech signal, wherein said phase difference isdetermined between a phase of a maximum amplitude of said speech signaland a phase zero of the first harmonic of the speech, wherein azero-crossing of the first harmonic defines the phase zero; andconcatenating the selected windowed diphone signals.
 19. The speechsynthesis device of claim 18, wherein the speech signal is a diphonesignal.
 20. The speech synthesis device of claim 18 the window functionbeing a raised cosine or a triangular window.
 21. The speech synthesisdevice of claim 18, wherein the processor is further configured toreceive information indicative of diphones and a pitch contour, and toselect the windowed diphones based on the information.
 22. Atext-to-speech system comprising: a language processor for providinginformation being indicative of diphones and a pitch contour of a speechsignal; and a speech synthesizer configured to: select windowed diphonesamples based on the information, the diphone samples being windowed bya window function being centered with respect to a phase angle which isdetermined as a phase difference between a phase of a maximum amplitudeof said speech signal and a first harmonic of the speech signal, whereina zero-crossing of the first harmonic defines the phase zero; andconcatenate the selected windowed diphone samples.
 23. Thetext-to-speech system of claim 22, whereby the window function is araised cosine or a triangular window.
 24. A speech processing systemcomprising a processor configured to: receive a signal comprisingnatural speech signal, window the natural speech signal by a windowfunction being centered with respect to a phase angle determined as aphase difference between a phase of a maximum amplitude of said naturalspeech signal and a phase zero of the first harmonic of the naturalspeech signal to provide windowed diphone samples, wherein azero-crossing of the first harmonic defines the phase zero, process thewindowed diphone samples, and concatenate the selected windowed diphonesamples.